IST 718 - Big Data Analytics

Final Project


Professor Lando

Brett Bastianelli, Alex Geiger, Maria Ng, Sam White

Table of Contents

image.png

Introduction

One of the biggest responsibilities of a government is to collect and organize data on their citizens to make the most responsible and informed decisions. The data that our group will be using is the census data from 2018 with key data points such as unemployment, graduation rate, marital status, and median income. All these key metrics, along with the other additional columns, allow for our group to act as a government organization observing the longstanding effects of divorce and high school graduation have on the general population. These findings can help provide educational programs and relief systems to struggling families, so actions, such as divorce, do not hinder their education.

Read In Data

Census Data Overview

Data Exploration

shows a boxplot representing the distribution for our variable of interest, high school degree rate. We can see that the median falls around 0.9. The distribution is left skewed with many outliers below 0.6. We do not want to remove these outliers since they are essential to understanding what contributes to low high school degree rates.

States with low graduation rates

When diving deeper into the zip-codes with a high school degree rate under 0.6, we see that California, Hawaii, Rhode Island and Texas have the highest number of zip-codes normalized by population that fall into that category.

The following figure shows us shows us the correlation between each of the desired variables for the model. We can see that rent, age, family size, and elderly have the highest correlations with high school degree. We can also see that “prep”, “n2”, “numdep”, “total vita”, “pac” and “elderly” are highly correlated with each other, which means that we can only accept one in the model. Since “elderly” has the highest correlation with high school degree, we will use it in the model.

The following figure shows us the correlation with added variables, property tax and lottery. We can see that there is no strong correlation between high school graduation rates.

Baseline Models

Three models were used to predict the high school graduation rate of a zip-code. The first was an OLS regression model that took the following input variables: divorced, median age, family median, population, state, not labor force and elderly. To produce a more accurate model, the high school degree variable was squared. The model resulted in an Adjusted R-squared of 0.623. The random forest model used all numeric input variables to predict the high school degree rate. The random forest model had a RMSE of 0.00047 One model was generated to predict the high graduation rate using dataset group by state. The OLS regression model was constructed based on the following variables: divorce, median age, not labor force, unemployed and property tax. The model resulted with an adjusted R-square of 0.746 with property tax was not significant with a p-value of 0.196.

Final Analysis Advanced Modeling

The advanced modeling section utilized the Microsoft LGBM models to predict the high school graduation rate of a particular tract. Census tracts instead of zip codes were used in the advanced modeling section because they are smaller than zip codes. On average, every zip code has 3.792 unique census tract. The zip code which has the highest number of census tracts is Costa Mesa in California which is located right outside of Long Beach California. In total there are 43 unique tracts located inside of Costa Mesa.

Feature Engineering

Here we added a few more features to the dataset. The four new features are the percent of land which is the total land area divided by the total area of the tract. Next is the population density which is calcuated by dividing the number of people in the tract by the total land area. Next the rent to income ration is created by dividing the median rent by the median house hold income. Lastly

Data Cleaning

In short, if we are missing any of our features or our target variable it will be removed from the training dataset.

Model Training and KFold Validation

The model validation technique was K-Fold cross validation. K-Fold cross validation is when the data is separated into equally sized segments called folds. After, one of the n folds is held and used for testing while the other n-1 folds are used for training. In total n models are iteratively trained and saved to a model's data structure. The reason K-Fold cross validation was utilized instead of a train test split was to ensure all the data was used for validation.

Modeling Results

In total 5 models were trained one for each of the five folds. Of these five models, each was evaluated using several different metrics. These metrics which were utilized to evaluate the training and testing dataset included mean average error, mean percent error and root mean square error. The average score for each of the five metrics is provided below.

LGBM Feature Importance

Every model that had been trained had a unique score for each feature in the model's feature set. Since multiple models had been trained, one for every fold this created a difficult problem for the visualization of the results. Therefore, the seaborn boxplot definition was utilized which statistically encapsulated the multiple values for each features score in a beautiful visual format.

The visualization speaks to how the leading indicators of the model are the median age followed by the median family income. From the analysis of the data, it is evident that the median age and high school graduation rates are positively correlated. Therefore, as the median age increases the graduation rate increases as well. In addition, the family income is also positively correlated to high school graduation rates. Therefore, as the median income increases so does the high school graduation rate.

Recommendation Introduction

For the project, the team wanted to go beyond predictive statistics and look to produce recommendations to help struggling communities increase their graduation rates. First, the team needed to identify a specific community struggling with getting students to graduate. The census tract the team had Identified was in Salians City California which had one of the lowest graduation rates in the country coming in at 28%.

Methodology

To inspire recommendations for the struggling community the team augment the features of the community and evaluate the impact of the augmentation on the community using the trained models. For instance, if the community were to increase its family income what impact would that have on high school graduation?

Specifically, the analysis that was conducted performed bivariate sensitivity analysis to quantify how changing two variables impacted the high school graduation rate. In short, a two-dimensional grid was generated. On the x-axis was variable 1 and, on the y-axis was variable 2. The value of the element corresponding to variable 1 and variable 2 was the high school graduation rate. All other variables or features in the model that were not variable 1 and variable 2 were kept static.

Results

The following vitalization depicts the impact of augmenting the family median income and the median age of residents of Salians City in California. From the analysis the median family income is an important driver of high school graduation rate. Therefore, the team recommends a job-fair to bring high paying jobs to the local area thereby increasing the graduation rate and solving this systemic problem. The job-fair would allow for individuals who may not have completed their education to see the many different career opportunities that are available. A large part of the education and financial disparity in America, at the moment, seems to be the lack of available information. Individuals may not know of what career paths would best fit and take the first paying opportunity to support themselves and their family. This career fair would break that cycle by providing information to what professional endeavors are available and how to apply for them. It appears in this instance that money may just solve the problem, this time.

Along with the recommendation of a job fair, the opportunity of developing tax incentives leading to a growth in high paying career opportunities in the area also shows promising signs. This recommendation would take more planning as we would be looking to directly influence the government rather than hosting a job fair, but it could lead to positive results for those impacted. If large companies were to receive tax credits by hiring individuals without a high school degree, it would be a mutual benefit. Individuals who were negatively impacted and could not complete their education due to extenuating circumstances such as a parental divorce, would receive a well-paying job that fits their skill set while the global company would receive a tax break at the end of the year. Along with a well-paying job, these individuals who are a part of the tax incentive program could also receive further funding or help with completing their GED to help make them more of a well-rounded employee and open to further advancements within the company.

Tract Information Stats